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Abstract 

We describe the key role played by partial evaluation in the Supercomputer Toolkit, a parallel computing 
system for scientific applications that effectively exploits the vast amount of parallelism exposed by partial 
evaluation. The Supercomputer Toolkit parallel processor and its associated partial evaluation-based 
compiler have been used extensively by scientists at M.I.T., and have made possible recent results in 
astrophysics showing that the motion of the planets in our solar system is chaotically unstable. 
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1 Introduction 

Previous work has shown that partial evaluation is good 
at breaking down data abstraction and exposing under¬ 
lying fine-grain parallelism in a program [4]. We have 
written a novel compiler which couples partial evalu¬ 
ation with static scheduling techniques to exploit this 
fine-grain parallelism by automatically mapping it onto 
a coarse-grain parallel architecture. 

Partial evaluation eliminates the barriers to parallel 
execution imposed by the data representation and the 
control structure of a program by taking advantage of 
information about the particular problem a program will 
be used to solve. For example, partial evaluation is able 
to perform at compile-time most data structure refer¬ 
ences, procedure calls, and conditional branches related 
to data structure size, leaving mostly numerical com¬ 
putations to be performed at run time. Partial eval¬ 
uation is particularly effective on numerically-oriented 
scientific programs, since they tend to be mostly data- 
independent, meaning that they contain large regions 
in which the operations to be performed do not depend 
on the numerical values of the data being manipulated. 
For instance, matrix multiplication performs the same 
set of operations, regardless of the particular numeri¬ 
cal values of the matrix elements. We use partial eval¬ 
uation to produce huge basic blocks from these data- 
independent numerical regions. These basic blocks often 
contain thousands of instructions, two orders of magni¬ 
tude larger than the basic blocks that typically arise in 
high-level language programs. To benefit from the fine- 
grain parallelism contained in these huge basic blocks, 
we schedule the partially-evaluated program for parallel 
execution primarily by performing the operations within 
an individual basic block in parallel. 

In order to automatically map the freshly derived fine- 
grain parallelism onto a multiprocessor, we developed a 
technique which coarsens the dataflow graph by selec¬ 
tively aggregating operations together. This technique 
uses heuristics which take the communication band¬ 
width, inter-processor communication latency, and pro¬ 
cessor architecture all into consideration. High inter¬ 
processor communication latency requires that there be 
enough parallelism available to allow each processor to 
continue to initiate operations, even while waiting for 
results produced elsewhere to arrive. Limited communi¬ 
cation bandwidth severely restricts the parallelism grain 
size that may be utilized by requiring that most val¬ 
ues used by a processor be produced on that processor, 
rather than being received from another processor. Our 
approach addresses these problems by tailoring the grain 
size adjustment and scheduling heuristics to match the 
communication capabilities of the target architecture. 

Our compiler operates in four major phases. The 
first phase performs partial evaluation, followed by tra¬ 
ditional compiler optimizations, such as constant folding 
and dead-code elimination. The second phase analyzes 
locality constraints within each basic block, locating op¬ 
erations that depend so closely on one another that it 
is clearly desirable that they be computed on the same 
processor. These closely related operations are grouped 
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Figure 1: Parallelism profile of the 9-body problem. This 
graph represents all of the parallelism available in the prob¬ 
lem, taking into account the varying latency of numerical 
operations. 

together to form a higher grain size instruction, known 
as a region. The third compilation phase uses heuris¬ 
tic scheduling techniques to assign each region to a pro¬ 
cessor. The final phase schedules the individual oper¬ 
ations for execution within each processor, accounting 
for pipelining, memory access restrictions, register allo¬ 
cation, and final allocation of the inter-processor com¬ 
munication pathways. 

The target architecture of our compiler is the Su¬ 
percomputer Toolkit , a parallel processor consisting of 
eight independent VLIW processors connected to each 
other by two shared communication busses [6], Per¬ 
formance measurements of actual compiled programs 
running on the Supercomputer Toolkit show that the 
code produced by our compiler for an important astro¬ 
physics application[19] runs 6.2 times faster on an eight- 
processor system than does near-optimal code executing 
on a single processor. The compilation process of this 
real world application is used as an example throughout 
this paper. 

2 The Partial Evaluator 

Partial evaluation converts a high-level, abstractly writ¬ 
ten, general purpose program into a low-level program 
that is specialized for the particular application at hand. 
For instance, a program that computes force interactions 
among a system of A r particles might be specialized to 
compute the gravitational interactions among 5 plan¬ 
ets of our particular solar system. This specialization 
is achieved by performing in advance, at compile time, 
all operations that do not depend explicitly on the actual 
numerical values of the data. 

Many data structure references, procedure calls, con¬ 
ditional branches, table lookups, loop iterations, and 
even some numerical operations may be performed in 
advance, at compile time, leaving only the underlying 
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Figure 2: Parallelism profile of the 9-body problem after op¬ 
erations have been grouped together to form regions. Com¬ 
parison with Figure 1 clearly shows that increasing the grain- 
size significantly reduced the opportunities for parallel exe¬ 
cution. The maximum speedup factor dropped front 69 to 49 
times faster than a single processor execution. 

numerical operations to be performed at run time 

Our compiler exposes fine-grain parallelism using a 
simple partial evaluation strategy based on a symbolic 
execution technique described in [5, 4]. 1 Despite this 
technique’s simplicity, it works well at exposing fine- 
grain parallelism. Figure 1 illustrates a parallelism pro¬ 
file analysis of the nine-body gravitational attraction 
problem of the type discussed in [19], 2 Partial evalu¬ 
ation exposed so much low-level parallelism that in the¬ 
ory, parallel execution could speed up the computation 
by a factor of 69 over a uniprocessor. 

3 Adjusting the Grain Size 

Searching for an optimal schedule for a program which 
exploits fine-grain parallelism is both computationally 
expensive and difficult to achieve. Rather than do an ex¬ 
haustive search for the optimal schedule, we developed 
a heuristic technique to coarsen the exposed fine-grain 
parallelism to a grain size suitable for critical-path based 
static scheduling. Prior to initiating critical-path based 
scheduling, we perform locality analysis that groups to¬ 
gether operations that depend so closely on one other 
that it would not be practical to place them in different 
processors. Each group of closely interdependent opera¬ 
tions forms a larger grain size macro-instruction, which 
we refer to as a region. 3 Some regions are large, while 

1 More complex partial evaluation strategies that address 
data-dependent computations may be found in [10, 12, 11]. 

“Specifically, one time-step of a 12th-order Stormer in¬ 
tegration of the gravity-induced motion of a 9-body solar 
system. 

3 The name region was chosen because we think of the 

grain size adjustment technique as identifying “regions” of 


others may be as small as one fine-grain instruction. In 
essence, grouping operations together to form a region 
is a way of simplifying the scheduling process by de¬ 
ciding in advance that certain opportunities for parallel 
execution will be ignored due to limited communication 
capabilities. 

Since operations within a region will occur on the 
same processor, the maximum region size must be cho¬ 
sen to match the communication capabilities of the tar¬ 
get architecture. For instance, if regions are permitted 
to grow too large, a single region might encompass the 
entire data-flow graph, forcing the entire computation 
to be performed on a single processor! Although strict 
limits are therefore placed on the maximum size of a re¬ 
gion, regions need not be of uniform size. Indeed, some 
regions will be large, corresponding to localized compu¬ 
tation of intermediate results, while others will be quite 
small, corresponding to results that are used globally 
throughout the computation. 

We have experimented with several different heuristics 
for grouping operations into regions. The optimal strat¬ 
egy for grouping instructions into regions varies with the 
application and with the communication limitations of 
the target architecture. However, we have found that 
even a relatively simple grain size adjustment strategy 
dramatically improves the performance of the schedul¬ 
ing process. As illustrated in Figure 3, when a value 
is used by only one instruction, the producer and con¬ 
sumer of that value may be grouped together to form a 
region, thereby ensuring that the scheduler will not place 
the producer and consumer on different processors in an 
attempt to use spare cycles wherever they happened to 
be available. Provided that the maximum region size 
is chosen appropriately, 4 grouping operations together 
based on locality prevents the scheduler from making 
gratuitous use of the communication channels, forcing it 
to focus on scheduling options that make more effective 
use of the limited communication bandwidth. 

An important aspect of grain size adjustment is that 
the grain size is not increased uniformly. As shown in 
Table 1, some regions are much larger than others. In¬ 
deed, it is important not to forcibly group non-localized 
operations into regions simply to increase the grain size. 
For example, it is likely that the result produced by an 
instruction that has many consumers will be transmitted 
amongst the processors, since it is not practical to place 
all of the consumers on the result-producing processor. 
In this case, creating a large region by grouping together 
the producer with only some of the consumers increases 


locality within the data-flow graph. The process of grain size 
adjustment is closely related to the problem of graph multi- 
section, although our region-finder is somewhat more partic¬ 
ular about the properties (shape, size, and connectivity) of 
each “region” sub-graph than are typical graph multisection 
algorithms. 

4 The region size must be chosen such that the compu¬ 
tational latency of the operations grouped together is well- 
matched to the communication bandwidth limitations of the 
architecture. If the regions are made too large, communi¬ 
cation bandwidth will be under utilized since the operations 
within a region do not transmit their results. 
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Figure 3: A Simple Region Forming Heuristic. A re¬ 
gion is formed by grouping together operations that have 
a simple producer/consumer relationship. This process is 
invoked repeatedly, with the region growing in size as ad¬ 
ditional producers are added. The region-growing process 
terminates when no suitable producers remain, or when the 
maximum region size is reached. A producer is considered 
suitable to be included in a region if it produces its result 
solely for use by that region. (The numbers shown within 
each node reflect the computational latency of the operation.) 


the grain size, but does not reduce inter-processor com¬ 
munication, since the result would need to be transmit¬ 
ted anyway. In other words, it only makes sense to limit 
the scheduler’s options by grouping operations together 
when doing so will clearly reduce inter-processor com¬ 
munication. 

4 Parallel Scheduling 

Exploiting locality by grouping operations into regions 
forces closely-related operations to occur on the same 
processor. Although this reduces inter-processor com¬ 
munication requirements, it also eliminates many op¬ 
portunities for parallel execution. Figure 2 shows the 
parallelism remaining in the 9-body problem after oper¬ 
ations have been grouped into regions. Comparison with 
Figure 1 shows that increasing the grain size eliminates 
about half of the opportunities for parallel execution. 
The challenge facing the parallel scheduler is to make ef¬ 
fective use of the limited parallelism that remains, while 
taking into consideration such factors as communication 
latency, memory traffic, pipeline delays, and allocation 
of resources such as processor buses and inter-processor 
communication channels. 

Our compiler schedules operations for parallel execu¬ 
tion in two phases. The first phase, known as the region- 
level scheduler, is primarily concerned with coarse-grain 
assignment of regions to processors, generating a rough 
outline of what the final program will look like. The 
region-level scheduler assigns each region to a proces¬ 
sor; determines the source, destinations, and approxi¬ 
mate time of transmission of each inter-processor mes¬ 


Region 

Size 

Number of 
Regions 

1 

108 

2 

28 

3 

28 

5 

56 

6 

1 

7 

8 

14 

36 

41 

24 

43 
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Table 1: The numerical operations in the 9-body program 
were divided into regions based on locality. This table shows 
how region size can vary depending on the locality structure 
of the computation. Region size is measured by computa¬ 
tional latency (cycles). The program was divided into 292 
regions, with an average region size of 7.56 cycles. The max¬ 
imal region size used was 43 cycles 


sage; and determines the preferred order of execution 
of the regions assigned to each processor. The region- 
level scheduler takes into account the latency of numer¬ 
ical operations, the inter-processor communication ca¬ 
pabilities of the target architecture, the structure (crit¬ 
ical path) of the computation, and which data values 
each processor will store in its memory. The region- 
level scheduler does not concern itself with finer-grain 
details such as the pipeline structure of the processors, 
the detailed allocation of each communication channel, 
or the ordering of individual operations within a proces¬ 
sor. At the coarse grain size associated with the schedul¬ 
ing of regions, a straightforward set of critical-path based 
scheduling heuristics 5 have proven quite effective. For 
the 9-body problem example, the computational load 
was spread so evenly that the variation in utilization 
efficiency among the 8 processors was only one percent. 

The final phase of the compilation process is 
instruction-level scheduling. The region-level scheduler 
provides the instruction-level scheduler with an ordered 
list of regions to execute on each processor along with a 
list of results that need to be transmitted when they are 
computed. The instruction-level scheduler chooses the 
final ordering of low-level operations within each pro¬ 
cessor, taking into account processor pipelining, register 
allocation, memory access restrictions, and availability 
of inter-processor-communication channels. Whenever 
possible, the order of operations is chosen so as to match 
the preferences of the region-level scheduler, represented 
by the ordered list of regions. However, the instruction- 
level scheduler is free to reorder operations as needed, 
intertwining operations among the regions assigned to 
a particular processor, without regard to which coarse- 
grain region they were originally a member of. This 
strategy allows the instruction scheduler to maintain a 

J The heuristics used by the region-level scheduler are 
closely related to list-scheduling [14]. A detailed discussion of 
the heuristics used by the region-level scheduler is presented 
in [1], 
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schedule similar to the one suggested by the region sched¬ 
uler, thereby ensuring that the results will be produced 
at approximately the time that other processors are ex¬ 
pecting them, while still taking advantage of fine grain 
parallelism available in other regions to fill pipeline slots 
as needed. 

The instruction-level scheduler derives low-level 
pipelined instructions for each processor, choosing the 
exact time and communication channel for each inter¬ 
processor transmission, and determining where values 
will be stored within each processor. The instruction- 
level scheduling process begins with a data-use analy¬ 
sis that determines which instructions share data val¬ 
ues and should therefore be placed near each other for 
register allocation purposes. This data-use informa¬ 
tion is combined with the higher-level ordering prefer¬ 
ences expressed by the region-level scheduler, produc¬ 
ing a scheduling priority for each instruction. The in¬ 
struction scheduling process then proceeds one cycle at 
a time, performing scheduling of that cycle on all pro¬ 
cessors before moving on to the next cycle. Instructions 
compete for resources based on their scheduling prior¬ 
ity; in each cycle, the highest-priority operation whose 
data and processor resources are available will be sched¬ 
uled. This competition for data and resources helps to 
keep each processor busy, by scheduling low-priority op¬ 
erations whose resources are available whenever the re¬ 
sources for higher priority computations are not avail¬ 
able. Indeed, when the performance of the instruction- 
scheduler is measured independently of the region-level 
scheduler, by generating code for a single Supercomputer 
Toolkit VLIW processor, utilization efficiencies in excess 
of 99.7% are routinely achieved, representing nearly op¬ 
timal code. 

An aspect of the scheduler that has proven to be 
particularly important is the retroactive scheduling of 
memory references. Although computation instructions 
(such as + or *) are scheduled on a cycle-by-cycle basis, 
memory LOAD instructions are scheduled retroactively, 
wherever they happen to fit in. For instance, when a 
computation instruction requires that a value be loaded 
into a register from memory, the actual memory access 
operation 6 is scheduled in the past for the earliest mo¬ 
ment at which both a register and a memory-bus cycle 
are available; the memory operation may occur fifty or 
even one-hundred instructions earlier than the computa¬ 
tion instruction. Supercomputer Toolkit memory opera¬ 
tions must compete for bus access with inter-processor 
messages, so retroactive scheduling of memory references 
helps to avoid interference between memory and commu¬ 
nication traffic. Figure 4 illustrates the effectiveness of 
the instruction level scheduler on the nine-body problem 
example. 


6 On the toolkit architecture, two memory operations may 
occur in parallel with computation and address-generation 
operations. This ensures that retroactively scheduled mem¬ 
ory accesses will not interfere with computations from previ¬ 
ous cycles that have already been scheduled. 
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Figure 4: The result of scheduling the 9-body problem onto 
8 Supercomputer Toolkit processors. Comparison with with 
the region-level parallelism profile (figure 3) illustrates how 
the scheduler spread the course-grain parallelism across the 
processors. A total of 340 cycles are required to complete the 
computation. On average, 6.5 of the 8 processors are utilized 
during each cycle. 

5 Performance Measurements 

The Supercomputer Toolkit and our associated compiler 
have been used for a wide variety of applications, rang¬ 
ing from computation of human genetic pedigrees to the 
simulation of electrical circuits. The applications that 
have generated the most interest from the scientific com¬ 
munity involve various integrations of the N-body grav¬ 
itational attraction problem. 7 Parallelization of these 
integrations has been previously studied by Miller[18], 
who parallelized the program by using futures to man¬ 
ually specify how parallel execution should be attained. 
Miller shows how one can re-write the N-body program 
so as to eliminate sequential data structure accesses to 
provide more effective parallel execution, manually per¬ 
forming some of the optimizations that partial evalu¬ 
ation provides automatically. Others have developed 
special-purpose hardware that parallelizes the 9-body 
problem by dedicating one processor per planet.[17] Pre¬ 
vious work in partial evaluation [3, 5, 4] has shown that 
the 9-body problem contains large amounts of fine-grain 
parallelism, suggesting that more subtle parallelizations 
are possible without the need to dedicate one processor 
to each planet. 

We have measured the effectiveness of coupling partial 
evaluation with grain size adjustment to generate code 
for the Supercomputer Toolkit parallel computer, an ar¬ 
chitecture that suffers from serious inter-processor com- 


7 For instance, [19] describes results obtained using the 
Supercomputer Toolkit that prove that the solar system’s dy¬ 
namics are chaotic. 
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SPEEDUP VS PROCESSORS 

N-body Storm er Integra 



PROCESSORS 

Figure 5: Speedup graph of Stormer integrations. Ample 
speedups are available to keep the 8-processor Supercomputer 
Toolkit busy, However, the incremental improvement of using 
more than 10 processors is relatively small. 


munication latency and bandwidth limitations. Table 2 
shows the parallel speedups achieved by our compiler for 
several different N-body interaction applications. Fig¬ 
ure 5 focuses on the 9-body program (ST9) discussed ear¬ 
lier in this paper, illustrating how the parallel speedup 
varies with the number of processors used. Note that 
as the number of processors increases beyond 10, the 
speedup curves level off. A more detailed analysis has 
revealed that this is due to the saturation of the inter¬ 
processor communication pathways, as illustrated in Fig¬ 
ure 6. The accuracy of these results was verified by exe¬ 
cuting the 9-body program on the actual Supercomputer 
Toolkit hardware in an eight processor configuration. 

An important drawback to the partial evaluation approach 
is that it results in the unrolling of loops, which can poten- 


Program 

Single 

Processor 

Cycles 

Eight 

Processors 

Cycles 

Speedup 

ST6 

5811 

954 

6.1 

ST9 

11042 

1785 

6.2 

ST12 

18588 

3095 

6.0 

RK9 

6329 

1228 

5.2 


Table 2: Speedups of various applications running on 8 
processors. Four different computations have been com¬ 
piled in order to measure the performance of the compiler: 
a 6 particle stormer integration(ST6), a 9 particle stormer 
integration(ST9), a 12 particle stormer integration(ST12), 
and a 9 particle fourth-order Runge Kutta integration(RK9). 
Speedup is the single processor execution time of the compu¬ 
tation divided by the total execution time on the multipro¬ 
cessor. 
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Figure 6: Utilization of the inter-processor communication 
pathways. The communication system becomes saturated at 
around 10 processors. This accounts for the lack of incremen¬ 
tal improvement available from using more than 10 processors 
that was seen in Figure 5. 

tially lead to an explosion in the size of the compiled program. 
We have found that depending on the size of the data set be¬ 
ing manipulated, partial evaluation may reduce the overall 
size of the program, by eliminating data accesses, branches, 
and abstraction-manipulation code; or partial evaluation may 
increase the size of the program by iterating over a large data 
set. The key to making successful use of the partial evalua¬ 
tion technique is to not carry it too far. For relatively small 
applications, such as the 9-body integration program, it was 
practical to partially-evaluate the entire computation; on the 
Other hand, if one was simulating a galaxy containing millions 
of stars, it would probably be best not to partially-evaluate 
some of the outermost loops! Our work focuses on achieving 
efficient parallel execution of the partially-evaluated segments 
of a program, leaving the decision of which portions of a pro¬ 
gram should be subjected to this compilation technique up 
to the programmer. 

6 Related Work 

The use of partial evaluation to expose parallelism makes 
our approach to parallel compilation fundamentally different 
from the approaches taken by other compilers. Traditionally, 
compilers have maintained the data structures and control 
structure of the original program. For example, if the orig¬ 
inal program represents an object as a doubly-linked list of 
numbers, the compiled program would as well. Only through 
partial evaluation can the data structures used by the pro¬ 
grammer to think about the problem be removed, leaving the 
compiler free to optimize the underlying numerical compu¬ 
tation, unhindered by sequentially-accessed data structures 
and procedure calls. However, the drawback to the partial- 
evaluation approach is that it is only highly efffective for 
applications that are mostly data-independent. 

Many compilers for high-performance architectures use 
program transformations to exploit low-level parallelism. For 
instance, compilers for vector machines unroll loops to help 
fill vector registers. Other parallelization techniques include 
trace-scheduling, software pipelining, vectorizing, as well as 
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static and dynamic scheduling of data-flow graphs. 

6.1 Trace Scheduling 

Compilers that exploit fine-grain parallelism often employ 
trace-scheduling techniques [15] to guess which way a branch 
will go, allowing computations beyond the branch to occur in 
parallel with those that precede the branch. Our approach 
differs in that we use partial evaluation to take advantage 
of information about the specific application at hand, allow¬ 
ing us to totally eliminate many data-independent branches, 
producing basic blocks on the order of several thousands of 
instructions, rather than the ten to thirty instructions typ¬ 
ically encountered by trace-scheduling based compilers. An 
interesting direction for future work would be to add trace¬ 
scheduling to our approach, to optimize across the data- 
dependent branches that occur at basic block boundaries. 

Most trace-scheduling based compilers use a variant of 
list-scheduling[14] to parallelize operations within an individ¬ 
ual basic block. Although list-scheduling using critical-path 
based heuristics is very effective when the grain size of the in¬ 
structions is well-matched to inter-processor communication 
bandwidth, we have found that in the case of limited band¬ 
width, a grain size adjustment phase is required to make the 
list-scheduling approach effective. 8 

6.2 Software Pipelining 

Software Pipelining [13] optimizes a particular fixed size loop 
structure such that several iterations of the loop are started 
on different processors at constant intervals of time. This in¬ 
creases the throughput of the computation. The effectiveness 
of software pipelining will be determined by whether the grain 
size of the parallelism expressed in the looping structure em¬ 
ployed by the programmer matches the architecture: software 
pipelining can not parallelize a computation that has its par¬ 
allelism hidden behind inherently sequential data references 
and spread across multiple loops. The partial-evaluation ap¬ 
proach on such a loop structure would result in the loop being 
completely unrolled with all of the sequential data structure 
references removed and all of the fine grain parallelism in 
the loop’s computation exposed and available for paralleliza¬ 
tion. In some applications, especially those involving partial 
differential equations, fully unrolling loops may generate pro¬ 
hibitively large programs. In these situations, partial evalua¬ 
tion could be used to optimize the innermost loops of a com¬ 
putation, with techniques such as software pipelining used to 
handle the outer loops. 

6.3 Vectorizing 

Vectorizing is a commonly used optimization for vector su¬ 
percomputers, executing operations on each vector element 

s The partial-evaluation phase of our compiler is currently 
not very well automated, requiring that the programmer pro¬ 
vide the compiler with a set of input data structures for each 
data-independent code sequence, as if the data-independent 
sequences are separate programs being glued together by the 
data-dependent conditional branches. This manual interface 
to the partial evaluator is somewhat of an implementation 
quirk; there is no reason that it could not be more automated. 
Indeed, several Supercomputer Toolkit users have built code 
generation systems on top of our compiler that automati¬ 
cally generate complete programs, including data-dependent 
conditionals, invoking the partial evaluator to optimize the 
data-independent portions of the program. Recent work by 
Weise, Ruf, and Katz[10, 11] describes additional techniques 
for automating the partial-evaluation process across data- 
dependent branches. 


in parallel. This technique is highly effective provided that 
the computation is composed primarily of readily identifiable 
vector operations (such as dot-product). Most vectorizing 
compilers generate vector code from a scalar specification by 
recognizing certain standard looping constructs. However, 
if the source program lacks the necessary vector-accessing 
loop structure, vectorizing performs very poorly. For com¬ 
putations that are mostly data-independent, the combina¬ 
tion of partial evaluation with static scheduling techniques 
has the potential to be vastly more effective than vectoriza- 
tion. Whereas a vectorizing compiler will often fail simply 
because the computation’s structure does not lend itself to a 
vector-oriented representation, the partial-evaluation/static 
scheduling approach can often succeed by making use of very 
fine-grained parallelism. On the other hand, for computa¬ 
tions that are highly data-dependent, or which have a highly 
irregular structure that makes unrolling loops infeasible, vec¬ 
torizing remains an important option. 

6.4 Iterative Restructuring 

Iterative restructuring represents the manual approach to 
parallelization. Programmer’s write and rewrite their code 
until the parallelizer is able to automatically recognize and 
utilize the available parallelism, There are many utilities for 
doing this, some of which are discussed in [16]. This approach 
is not flexible in that whenever one aspect of the computation 
is changed, one must ensure that parallelism in the changed 
computation is fully expressed by the loop and data-reference 
structure of the program. 

6.5 Static Scheduling 

Static scheduling of the fine-grained parallelism embedded 
in large basic blocks has also also been investigated for use 
on the Oscar architecture at Waseda University in Japan.[7]. 
The Oscar compiler uses a technique called task fusion that is 
similar in spirit to the grain size adjustment technique used 
on the Supercomputer Toolkit. However, the Oscar compiler 
lacks a partial-evaluation phase, leaving it to the program¬ 
mer to manually generate large basic blocks. Although the 
manual creation of huge basic blocks (or of automated pro¬ 
gram generators) may be practical for computations such as 
an FFT that have a very regular structure, it is not a rea¬ 
sonable alternative for more complex programs that require 
abstraction and complex data, structure representations. For 
example, imagine writing out the 11,000 floating-point, oper¬ 
ations for the Stormer integration of the Solar system and 
then suddenly realizing that you need to change to a. differ¬ 
ent integration method. The manual coder would grimace, 
whereas a programmer writing code for a compiler that, uses 
partial evaluation would simply alter a high-level procedure 
call. 

7 Conclusions 

Partial evaluation has an important role to play in 
the parallel compilation process, especially for largely 
data-independent programs such as those associated with 
numerically-oriented scientific computations. Our approach 
of adjusting the grain size of the computation to match the 
architecture was possible only because of partial evaluation: 
If we had taken the more conventional approach of using 
the structure of the program to detect parallelism, we would 
then be stuck with the grain size provided us by the program¬ 
mer. By breaking down the program structure to its finest 
level, and then imposing our own program structure (regions) 
based on locality of reference, we have the freedom to choose 
the grain size to match the architecture. The coupling of 
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partial evaluation with static scheduling techniques in the 
Supercomputer Toolkit compiler also eliminates the need to 
write programs in an obscure style that makes parallelism 
more apparent. 
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